Micro4all workflow

Introduction

This package is intended to record R functions we usually use in our laboratory for the analysis of amplicon data. The following tutorial has been created to show the main utilities of this package, as well as to make public our main workflow. We will cover several steps, including:

  • Produce an amplicon sequence variant (ASV) table, making use of dada21 package.
  • Study microbial diversity (\(\alpha\) and \(\beta\) diversity).
  • Get taxonomical profiles at phylum and genus level.
  • Differential abundance analysis with ANCOMBC2 package.
  • Ordination graphics and environmental fitting of soil physicochemical parameters.

Dataset

In this tutorial, we will use a bunch of sequences of our own. Specifically, we collected endosphere samples of 8 olive trees (Picual variety) from three different locations (let’s call them location 1, location 2 and location 3) and sequenced the V3-V4 region of the 16S rRNA gene (with Illumina MiSeq technology) in order to study the bacterial community. These paired-end fastq files (already “demultiplexed” and without barcodes) can be download here. In this link you can also find a phyloseq object that will be used in environmental fitting section of this tutorial.

Install package micro4all

Usually, some dependencies are needed for micro4all, like microbiome, BiocGenerics or phyloseq packages. I recommend installing them (if not present before) with BioConductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("microbiome")

Let’s load the library

#devtools::install_github("nuriamw/micro4all")
library(micro4all)

Produce an amplicon sequence variant (ASV) table

The first part of this tutorial is mainly based in the information provided by dada2 but with our own implementations. We will end this section with an ASV table, already cleaned and ready for diversity analysis.

Prepare data: libraries and read sequences files

library(dada2); packageVersion("dada2")
#> [1] '1.20.0'
library(ShortRead);packageVersion("ShortRead")
#> [1] '1.50.0'
library(devtools);packageVersion("devtools")
#> [1] '2.4.2'
library(ggplot2);packageVersion("ggplot2")
#> [1] '3.3.5'
library(tidyverse);packageVersion("tidyverse")
#> [1] '1.3.1'
path <- "~/Escritorio/data_micro4all/" # CHANGE ME to the directory containing the fastq files after unzipping.
list.files(path)
#>  [1] "ASV_length_nochim.rds"    "cutadapt"                 "figaro"                  
#>  [4] "filtered"                 "filtN"                    "L1-1_S102_R1_001.fastq"  
#>  [7] "L1-1_S102_R2_001.fastq"   "L1-2_S112_R1_001.fastq"   "L1-2_S112_R2_001.fastq"  
#> [10] "L1-3_S121_R1_001.fastq"   "L1-3_S121_R2_001.fastq"   "L1-4_S130_R1_001.fastq"  
#> [13] "L1-4_S130_R2_001.fastq"   "L1-5_S138_R1_001.fastq"   "L1-5_S138_R2_001.fastq"  
#> [16] "L1-6_S145_R1_001.fastq"   "L1-6_S145_R2_001.fastq"   "L1-7_S152_R1_001.fastq"  
#> [19] "L1-7_S152_R2_001.fastq"   "L1-8_S159_R1_001.fastq"   "L1-8_S159_R2_001.fastq"  
#> [22] "L2-1_S139_R1_001.fastq"   "L2-1_S139_R2_001.fastq"   "L2-2_S146_R1_001.fastq"  
#> [25] "L2-2_S146_R2_001.fastq"   "L2-3_S153_R1_001.fastq"   "L2-3_S153_R2_001.fastq"  
#> [28] "L2-4_S160_R1_001.fastq"   "L2-4_S160_R2_001.fastq"   "L2-5_S104_R1_001.fastq"  
#> [31] "L2-5_S104_R2_001.fastq"   "L2-6_S114_R1_001.fastq"   "L2-6_S114_R2_001.fastq"  
#> [34] "L2-7_S123_R1_001.fastq"   "L2-7_S123_R2_001.fastq"   "L2-8_S132_R1_001.fastq"  
#> [37] "L2-8_S132_R2_001.fastq"   "L3-1_S136_R1_001.fastq"   "L3-1_S136_R2_001.fastq"  
#> [40] "L3-2_S143_R1_001.fastq"   "L3-2_S143_R2_001.fastq"   "L3-3_S150_R1_001.fastq"  
#> [43] "L3-3_S150_R2_001.fastq"   "L3-4_S157_R1_001.fastq"   "L3-4_S157_R2_001.fastq"  
#> [46] "L3-5_S101_R1_001.fastq"   "L3-5_S101_R2_001.fastq"   "L3-6_S111_R1_001.fastq"  
#> [49] "L3-6_S111_R2_001.fastq"   "L3-7_S120_R1_001.fastq"   "L3-7_S120_R2_001.fastq"  
#> [52] "L3-8_S129_R1_001.fastq"   "L3-8_S129_R2_001.fastq"   "MOCK-1_S169_R1_001.fastq"
#> [55] "MOCK-1_S169_R2_001.fastq" "MOCK-2_S176_R1_001.fastq" "MOCK-2_S176_R2_001.fastq"
#> [58] "MOCK-3_S183_R1_001.fastq" "MOCK-3_S183_R2_001.fastq" "nochim"                  
#> [61] "num_seqs.txt"             "rdp_train_set_18_H.fa"    "tabla_otu_inicial.rds"   
#> [64] "taxa_rdp_16S.rds"


## SORT FORWARD AND REVERSE READS SEPARETELY; forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq ##
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

📝TIP: If your file names are more complex and you need to remove some previous codes, you can always use gsub over sample.names or strsplit again.

Check sequences quality

A crucial step in any sequencing analysis pipeline is quality checking and we will use several approaches. First, we will generate a table with the number of reads in each sample, just to check if anything went wrong during sequencing process.

##COUNT NUMBER OF READS IN EACH SAMPLE BEFORE FILTERING 
raw_reads_count <- NULL

for (i in 1:length(fnFs)){

  raw_reads_count <- rbind(raw_reads_count, length(ShortRead::readFastq(fnFs[i])))}

#Format table
rownames(raw_reads_count)<- sample.names
colnames(raw_reads_count)<- "Number of reads"

#Get min and max number of sequences
min(raw_reads_count)
#> [1] 66426
max(raw_reads_count)
#> [1] 125873

As we can see, even the minimum number of sequences is good enough for our analysis. We can also check with a histogram the distribution of sequences length.

reads <- ShortRead::readFastq(fnFs)

#Get lengths with unique
counts= NULL
uniques <- unique(reads@quality@quality@ranges@width)

#count number of sequences for each length
for (i in 1:length(uniques)) {
  counts<- rbind(counts,length(which(reads@quality@quality@ranges@width==uniques[i])))

}

#format histogram table
histogram <-  cbind(uniques,counts)
colnames(histogram) <- c("Seq.length", "counts")

#Check histogram matrix
head(histogram[order(histogram[,1],decreasing = TRUE),]) #Most sequences fall in expected sequence length
#>      Seq.length  counts
#> [1,]        301 2154642
#> [2,]        300   71753
#> [3,]        299   16251
#> [4,]        298   61651
#> [5,]        297   10119
#> [6,]        296     377
# PLOT HISTOGRAM
hist(reads@quality@quality@ranges@width, main="Forward length distribution", xlab="Sequence length", ylab="Raw reads")

ITS considerations

Because of the variable length of ITS region, this step provides little information when analyzing ITS amplicon sequences.

Moreover, plotQualityProfile function of dada2 is very useful for a quick visualization of sequences quality profile.

## VIEW AND SAVE QUALITY PLOT FOR FW AND RV ##
plotQualityProfile(fnFs[1:2])
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

plotQualityProfile(fnRs[1:2])
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

Figaro: selection of trimming parameters

In this tutorial, we will use the Figaro3 tool. This tool makes the selection of trimming parameters for dada2 more objective. Sadly, Figaro has not yet been updated to lead with variations in sequence lengths. We will solve this following author’s recommendation (which you can find here). It consist on performing a pre-trimming step at about 295 bp. Let’s see how it works!

##FIGARo
##FILTER READS
figFs <- file.path(path, "figaro", basename(fnFs))
figRs <- file.path(path, "figaro", basename(fnRs))

##TRIMMING AT 295 pb
out.figaro <- filterAndTrim(fnFs, figFs, fnRs, figRs,
                     compress=TRUE, multithread=TRUE, truncLen=c(295,295)) #

##RUN FIGARO
figaro <- system(("python3 /home/programs/figaro/figaro/figaro.py -i ~/Escritorio/data_micro4all/figaro -o ~/Escritorio/data_micro4all/figaro -a 426 -f 17 -r 21"),
                 intern=TRUE) #path to figaro program -i path to input - path to output -a length of your amplicon without primers, -f primer forward length, -r primer reverse length


head(figaro)
#> [1] "Forward read length: 295"                                                                                                   
#> [2] "Reverse read length: 295"                                                                                                   
#> [3] "{\"trimPosition\": [279, 205], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.69, \"score\": 76.68937485688116}"
#> [4] "{\"trimPosition\": [278, 206], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.68, \"score\": 76.6750629722922}" 
#> [5] "{\"trimPosition\": [277, 207], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.67, \"score\": 76.66838409281735}"
#> [6] "{\"trimPosition\": [280, 204], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.66, \"score\": 76.6617052133425}"

We will select the first result, which gives us the *trimming positions** and *maximum expected errors that produces the greater score according to Fígaro** model.

ITS considerations

As mentioned before, ITS region has very variable length. Consequently, no trimming step is applied to these sequences and it would not be necessary to use Figaro.

Filter and trim sequences

We will apply Fígaro parameters to filterAndTrim function

### FILTER AND TRIM SEQUENCES ACCORDING TO FIGARO ####

## Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", basename(fnFs))
filtRs <- file.path(path, "filtered", basename(fnRs))

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(279,205),
                     maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE, minLen=50)


head(out)
#>                        reads.in reads.out
#> L1-1_S102_R1_001.fastq    80054     61741
#> L1-2_S112_R1_001.fastq    76776     60781
#> L1-3_S121_R1_001.fastq    84947     67563
#> L1-4_S130_R1_001.fastq    87190     68997
#> L1-5_S138_R1_001.fastq   114810     93159
#> L1-6_S145_R1_001.fastq    88519     70075

Cutadapt: efficient removal of primers

Now that we have run Figaro and performed quality trimming, we can follow with primer removing. This will be done with cutapat4 tool. Cutadapt should not be applied before Figaro because it can modify the quality profiles in which Figaro is based.

#### IDENTIFY PRIMERS ####

FWD <- "CCTACGGGNBGCASCAG"  ## CHANGE ME to your forward primer sequence
REV <- "GACTACNVGGGTATCTAATCC"  ## CHANGE ME to your reverse primer sequence

## VERIFY PRESENCE AND ORENTATION OF PRIMERS ##
allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
  orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = reverse(dna),
               RevComp = reverseComplement(dna))
  return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
#>             Forward          Complement             Reverse             RevComp 
#> "CCTACGGGNBGCASCAG" "GGATGCCCNVCGTSGTC" "GACSACGBNGGGCATCC" "CTGSTGCVNCCCGTAGG"
REV.orients
#>                 Forward              Complement                 Reverse                 RevComp 
#> "GACTACNVGGGTATCTAATCC" "CTGATGNBCCCATAGATTAGG" "CCTAATCTATGGGVNCATCAG" "GGATTAGATACCCBNGTAGTC"
## COUNT THE APPEARENCE AND ORIENTATION OF PRIMERS ##
primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = filtFs[[1]]),
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = filtRs[[1]]),
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = filtFs[[1]]),
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = filtRs[[1]]))
#>                  Forward Complement Reverse RevComp
#> FWD.ForwardReads   59926          0       0       0
#> FWD.ReverseReads       0          0       0       0
#> REV.ForwardReads       2          0       0       0
#> REV.ReverseReads   60270          0       0       0

At this point, and as it is explained in dada2 ITS tutorial, we found what we expected to: the FWD primer appears in the forward reads in its forward orientation, but it also show up in some reverse reads in its reverse-complement orientation. Something similar occurs with reverse primer.

Now, let’s run cutadapt. We use argument ^ for anchoring primers. Moreover, we combine it with the –discard-untrimed argument. In this way, we make sure that a sequence is not retained if only the reverse primer is found. Forward and reverse primers should be present in order to keep the sequences.

cutadapt <- "/usr/local/bin/cutadapt" #path to cutadapt 

system2(cutadapt, args = c("--version")) # Run shell commands from R

##Create path to cutadapt sequences
path.cut <- file.path(path, "cutadapt") 
if(!dir.exists(path.cut)) dir.create(path.cut)

fnFs.cut <- file.path(path.cut, basename(filtFs))
fnRs.cut <- file.path(path.cut, basename(filtRs))


##Produce arguments for cutadapt
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste0("-a", " ", "^",FWD,"...", REV.RC) 

# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste0("-A"," ","^", REV, "...", FWD.RC)


# Run Cutadapt

for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,"-m", 1, # -n 2 required to remove FWD and REV from reads
                             #-m 1 is required to remove empty sequences for plotting quality plots
                             "--discard-untrimmed",
                             "-j",0,#automatically detect number of cores
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             filtFs[i], filtRs[i],# input files
                             "--report=minimal")) #Report minimal reports a summary

}

Let’s see if cutadapt has properly removed the primers

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
#>                  Forward Complement Reverse RevComp
#> FWD.ForwardReads       1          0       0       0
#> FWD.ReverseReads       0          0       0       0
#> REV.ForwardReads       0          0       0       0
#> REV.ReverseReads       0          0       0       0
cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq", full.names = TRUE))

As we can see, there are still some forward primers to be found on forward reads. We should not worry about that. Cutadapt and primerHits have different ways to look for primers in the sequence. Moreover, if primers are still to be found in low quality sequences, they will probably be removed in next steps.

To continue with, we will apply the main core of DADA2 package. As it is very well detailed in its tutorial, we will not make a lot of comments about it

Learn error rates

#### LEARN ERROR RATES ####
errF <- learnErrors(fnFs.cut, multithread=T, verbose=1 ) #
#> 109967023 total bases in 419736 reads from 6 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ......
#>    selfConsist step 2
#>    selfConsist step 3
#>    selfConsist step 4
#>    selfConsist step 5
#> Convergence after  5  rounds.
errR <- learnErrors(fnRs.cut, multithread=T, verbose=1) #
#> 102040114 total bases in 554558 reads from 8 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ........
#>    selfConsist step 2
#>    selfConsist step 3
#>    selfConsist step 4
#>    selfConsist step 5
#>    selfConsist step 6
#> Convergence after  6  rounds.

View error plots

As we can see, the plots follow the tendency adequately.

plotErrors(errF, nominalQ=TRUE)
#> Warning: Transformation introduced infinite values in continuous y-axis

#> Warning: Transformation introduced infinite values in continuous y-axis

plotErrors(errR, nominalQ=TRUE)
#> Warning: Transformation introduced infinite values in continuous y-axis

#> Warning: Transformation introduced infinite values in continuous y-axis

Sample inference

dadaFs <- dada(fnFs.cut, err=errF, multithread=TRUE)
#> Sample 1 - 61506 reads in 15424 unique sequences.
#> Sample 2 - 60287 reads in 12454 unique sequences.
#> Sample 3 - 67108 reads in 13471 unique sequences.
#> Sample 4 - 68541 reads in 13392 unique sequences.
#> Sample 5 - 92672 reads in 19174 unique sequences.
#> Sample 6 - 69622 reads in 14664 unique sequences.
#> Sample 7 - 66191 reads in 14858 unique sequences.
#> Sample 8 - 68631 reads in 17064 unique sequences.
#> Sample 9 - 64360 reads in 12997 unique sequences.
#> Sample 10 - 73537 reads in 16302 unique sequences.
#> Sample 11 - 64206 reads in 12105 unique sequences.
#> Sample 12 - 74353 reads in 16945 unique sequences.
#> Sample 13 - 62585 reads in 14748 unique sequences.
#> Sample 14 - 72255 reads in 16401 unique sequences.
#> Sample 15 - 69370 reads in 15705 unique sequences.
#> Sample 16 - 56068 reads in 13202 unique sequences.
#> Sample 17 - 73197 reads in 14786 unique sequences.
#> Sample 18 - 58236 reads in 12100 unique sequences.
#> Sample 19 - 56801 reads in 13257 unique sequences.
#> Sample 20 - 56072 reads in 12635 unique sequences.
#> Sample 21 - 57847 reads in 12828 unique sequences.
#> Sample 22 - 65716 reads in 11388 unique sequences.
#> Sample 23 - 69528 reads in 13095 unique sequences.
#> Sample 24 - 52479 reads in 9714 unique sequences.
#> Sample 25 - 89584 reads in 7824 unique sequences.
#> Sample 26 - 93427 reads in 8209 unique sequences.
#> Sample 27 - 106500 reads in 9724 unique sequences.
dadaRs <- dada(fnRs.cut, err=errR, multithread=TRUE)
#> Sample 1 - 61506 reads in 15090 unique sequences.
#> Sample 2 - 60287 reads in 8972 unique sequences.
#> Sample 3 - 67108 reads in 10240 unique sequences.
#> Sample 4 - 68541 reads in 11098 unique sequences.
#> Sample 5 - 92672 reads in 15470 unique sequences.
#> Sample 6 - 69622 reads in 11810 unique sequences.
#> Sample 7 - 66191 reads in 10626 unique sequences.
#> Sample 8 - 68631 reads in 13787 unique sequences.
#> Sample 9 - 64360 reads in 9862 unique sequences.
#> Sample 10 - 73537 reads in 11339 unique sequences.
#> Sample 11 - 64206 reads in 8219 unique sequences.
#> Sample 12 - 74353 reads in 12900 unique sequences.
#> Sample 13 - 62585 reads in 11369 unique sequences.
#> Sample 14 - 72255 reads in 12843 unique sequences.
#> Sample 15 - 69370 reads in 12557 unique sequences.
#> Sample 16 - 56068 reads in 11103 unique sequences.
#> Sample 17 - 73197 reads in 12138 unique sequences.
#> Sample 18 - 58236 reads in 9899 unique sequences.
#> Sample 19 - 56801 reads in 9540 unique sequences.
#> Sample 20 - 56072 reads in 10430 unique sequences.
#> Sample 21 - 57847 reads in 9716 unique sequences.
#> Sample 22 - 65716 reads in 9249 unique sequences.
#> Sample 23 - 69528 reads in 11676 unique sequences.
#> Sample 24 - 52479 reads in 8641 unique sequences.
#> Sample 25 - 89584 reads in 6378 unique sequences.
#> Sample 26 - 93427 reads in 6198 unique sequences.
#> Sample 27 - 106500 reads in 7362 unique sequences.
dadaFs[[1]]
#> dada-class: object describing DADA2 denoising results
#> 758 sequence variants were inferred from 15424 input unique sequences.
#> Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
names(dadaFs) <- sample.names
names(dadaRs) <- sample.names

Merge paired-end reads

mergers <- mergePairs(dadaFs, fnFs.cut, dadaRs, fnRs.cut, verbose=TRUE)
#> 58056 paired-reads (in 841 unique pairings) successfully merged out of 59510 (in 1465 pairings) input.
#> 57884 paired-reads (in 527 unique pairings) successfully merged out of 58969 (in 973 pairings) input.
#> 65180 paired-reads (in 492 unique pairings) successfully merged out of 66042 (in 881 pairings) input.
#> 66703 paired-reads (in 486 unique pairings) successfully merged out of 67552 (in 865 pairings) input.
#> 89414 paired-reads (in 932 unique pairings) successfully merged out of 90777 (in 1561 pairings) input.
#> 66897 paired-reads (in 665 unique pairings) successfully merged out of 68102 (in 1223 pairings) input.
#> 63906 paired-reads (in 609 unique pairings) successfully merged out of 64927 (in 1106 pairings) input.
#> 65300 paired-reads (in 860 unique pairings) successfully merged out of 66704 (in 1524 pairings) input.
#> 62656 paired-reads (in 606 unique pairings) successfully merged out of 63300 (in 889 pairings) input.
#> 71291 paired-reads (in 635 unique pairings) successfully merged out of 72263 (in 1082 pairings) input.
#> 62506 paired-reads (in 443 unique pairings) successfully merged out of 63204 (in 757 pairings) input.
#> 71825 paired-reads (in 733 unique pairings) successfully merged out of 72805 (in 1226 pairings) input.
#> 59867 paired-reads (in 626 unique pairings) successfully merged out of 61145 (in 1295 pairings) input.
#> 69397 paired-reads (in 807 unique pairings) successfully merged out of 70680 (in 1361 pairings) input.
#> 66800 paired-reads (in 773 unique pairings) successfully merged out of 67865 (in 1198 pairings) input.
#> 53498 paired-reads (in 691 unique pairings) successfully merged out of 54585 (in 1149 pairings) input.
#> 70984 paired-reads (in 668 unique pairings) successfully merged out of 71999 (in 1094 pairings) input.
#> 56672 paired-reads (in 449 unique pairings) successfully merged out of 57345 (in 744 pairings) input.
#> 54758 paired-reads (in 705 unique pairings) successfully merged out of 55509 (in 1006 pairings) input.
#> 54202 paired-reads (in 648 unique pairings) successfully merged out of 54911 (in 986 pairings) input.
#> 55901 paired-reads (in 585 unique pairings) successfully merged out of 56778 (in 1000 pairings) input.
#> 64397 paired-reads (in 475 unique pairings) successfully merged out of 65119 (in 738 pairings) input.
#> 67867 paired-reads (in 532 unique pairings) successfully merged out of 68681 (in 897 pairings) input.
#> 51487 paired-reads (in 353 unique pairings) successfully merged out of 51953 (in 540 pairings) input.
#> 89364 paired-reads (in 74 unique pairings) successfully merged out of 89388 (in 85 pairings) input.
#> 93212 paired-reads (in 82 unique pairings) successfully merged out of 93267 (in 96 pairings) input.
#> 106318 paired-reads (in 139 unique pairings) successfully merged out of 106393 (in 160 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
#>                                                                                                                                                                                                                                                                                                                                                                                                                    sequence
#> 1   TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCGCGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 2 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGCTTTCGGGTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 3   TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCACGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 4   TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGGGTGACTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 5   TGGGGAATATTGCGCAATGGGCGGAAGCCTGACGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCTAACGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCCGTCGTGAAAGCCCACGGCTTAACTGTGGGTCTGCGGTGGATACGGGCAGACTAGAGGCAGGTAGGGGAGCATGGAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTGCTCTGGGCCTGTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 6      TGGGGAATCTTGGACAATGGGCGAAAGCCCGATCCAGCAATATCGCGTGAGTGAAGAAGGGCAATGCCGCTTGTAAAGCTCTTTCATCGAGTGCGCGATCATGACAAGACTCGAGGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGGAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACA
#>   abundance forward reverse nmatch nmismatch nindel prefer accept
#> 1      6997       1       1     39         0      0      2   TRUE
#> 2      5696       2       2     37         0      0      2   TRUE
#> 3      3738       3       1     39         0      0      2   TRUE
#> 4      2282       4       2     39         0      0      2   TRUE
#> 5      1950       5       5     39         0      0      2   TRUE
#> 6      1931       6       3     42         0      0      2   TRUE

Construct sequence table

seqtab <- makeSequenceTable(mergers)
dim(seqtab) 
#> [1]   27 7080

Remove chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
#> Identified 2163 bimeras out of 7080 input sequences.
dim(seqtab.nochim)
#> [1]   27 4917

Inspect ASV length and abundance

table(nchar(getSequences(seqtab.nochim)))  #Number of ASV of each length
#> 
#>  244  247  253  255  259  262  264  269  273  276  299  304  311  316  321  327  338  339  356 
#>    2    2    1    1    1    4    1    1    1    1    1    1    1    1    1    1    1    1    1 
#>  359  361  364  369  370  373  376  380  382  384  385  386  390  396  397  400  401  402  403 
#>    1    1    2    1    1    1    1    1    2    2    6    8    1    5    2    4   44  984  294 
#>  404  405  406  407  408  409  410  411  412  413  414  415  416  417  418  419  420  421  422 
#>  202  156   20  387   82   44   24   20    3   11    7    5   50   16   18   32   27  216  552 
#>  423  424  425  426  427  428  429  430  431  432 
#>   40   29   57  265 1026  218   14    3    2    6

This is useful, but we also want to know the abundance of sequences for each length. Let’s calculate it and make a plot

reads.per.seqlen <- tapply(colSums(seqtab.nochim), factor(nchar(getSequences(seqtab.nochim))), sum) #number of sequences for each length
reads.per.seqlen
#>    244    247    253    255    259    262    264    269    273    276    299    304    311 
#>      4    104      2      4      2    175      6      3      2      2      2      4      2 
#>    316    321    327    338    339    356    359    361    364    369    370    373    376 
#>      2      2      4      2      2      2      2      4     50      4      2      6      4 
#>    380    382    384    385    386    390    396    397    400    401    402    403    404 
#>      4  36451      8     40   1082      2     15     13    661    504 130086  99017 277172 
#>    405    406    407    408    409    410    411    412    413    414    415    416    417 
#>   3998    995 592438 116044  58285   1601   1328   4790    179   2455     93   2878    133 
#>    418    419    420    421    422    423    424    425    426    427    428    429    430 
#>    612   5778   1023  12092  29632    268    448   4098  12680 370689  13647    119      8 
#>    431    432 
#>      4    213
## Plot length distribution
table_reads_seqlen <- data.frame(length=as.numeric(names(reads.per.seqlen)), count=reads.per.seqlen)

ggplot(data=table_reads_seqlen, aes(x=length, y=count)) + geom_col()

According to previous information, we will choose the interval ranging from 402 to 428 pb

seqtab.nochim <- seqtab.nochim[,nchar(colnames(seqtab.nochim)) %in% seq(402,428)]

Check number of reads

At this point, it is important to check how many sequences we have lost in each step. Apart from filtering, not other step should have led to a substantial decreased in sequence number.

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names


#Let's apply a mutate to get percentage, making it easier to analyze.
track <- track %>%as.data.frame() %>% mutate(Perc.input=filtered*100/input,
                                             Perc.denoisedF= denoisedF*100/filtered,
                                             Perc.denoisedR= denoisedR*100/filtered,
                                             Perc.merged= merged*100/filtered,
                                             Perc.nonchim= nonchim*100/merged)

head(track)
#>       input filtered denoisedF denoisedR merged nonchim Perc.input Perc.denoisedF
#> L1-1  80054    61741     60003     60662  58056   55455   77.12419       97.18501
#> L1-2  76776    60781     59308     59694  57884   56153   79.16667       97.57655
#> L1-3  84947    67563     66299     66670  65180   63021   79.53548       98.12915
#> L1-4  87190    68997     67828     68092  66703   63831   79.13408       98.30572
#> L1-5 114810    93159     91321     91801  89414   84491   81.14189       98.02703
#> L1-6  88519    70075     68479     68988  66897   64905   79.16380       97.72244
#>      Perc.denoisedR Perc.merged Perc.nonchim
#> L1-1       98.25238    94.03152     95.51984
#> L1-2       98.21161    95.23371     97.00954
#> L1-3       98.67827    96.47292     96.68763
#> L1-4       98.68835    96.67522     95.69435
#> L1-5       98.54228    95.97999     94.49415
#> L1-6       98.44880    95.46486     97.02229

Everything is as expected, so let’s keep on with classification!

Taxonomy classification with RDP

taxa_rdp <- assignTaxonomy(seqtab.nochim, "~/Escritorio/data_micro4all/rdp_train_set_18_H.fa", multithread=TRUE)

Get ASV table

We will produce a raw ASV table that can be saved as .txt. This allows us to manipulate the data manually if needed, as well as saving a raw data before further steps are performed.

ASV <- seqtab.nochim
ASVt <- t(ASV)

## SUBSTITUTE 'NA' WITH 'UNCLASSIFIED'and remove species column
taxa_rdp_na <- apply(taxa_rdp,2, tidyr::replace_na, "unclassified")[,-7]

##Create names for each ASV: if there are 1000 ASVs, call them from ASV0001 to ASV1000
number.digit<- nchar(as.integer(nrow(ASVt)))
names <- paste0("ASV%0", number.digit, "d") #As many 0 as digits
ASV_names<- sprintf(names, 1:nrow(ASVt))

## CREATE AND SAVE ASV TABLE
ASV_table_classified_raw<- cbind(as.data.frame(taxa_rdp_na,stringsAsFactors = FALSE),as.data.frame(ASV_names, stringsAsFactors = FALSE),as.data.frame(ASVt,stringsAsFactors = FALSE))

##Make rownames a new column, in order to keep sequences during the filtering process
ASV_seqs <- rownames(ASV_table_classified_raw)
rownames(ASV_table_classified_raw) <- NULL
ASV_table_classified_raw <- cbind(ASV_seqs, ASV_table_classified_raw)

MOCK community

In order to filter out sporious ASV, we always introduce in our sequencing three samples of a commercial MOCK community (ZymoBIOMICS Microbial Community Standard II (Log Distribution), ZYMORESEARCH, CA, USA). Because we already know its composition, we can check if there is any ASV identified by the sequencing process that is not a real member of the community. Once identified, we calculate the percentage of sequences it represents. We make the assumption that ASVs that represent less than this percentage should be discarded from our data. In order to automate this analysis, micro4all implements a function called MockCommunity. Let’s see how it works:

ASV_filtered_sorted<-MockCommunity(ASV_table_classified_raw, mock_composition, ASV_column = "ASV_names") #mock_composition is a data frame included as data in this package. See documentation for details.
#> ASV0811 does not belong to the MOCK community. It representes a  0.031757  perc. of the sequences 
#>  and it classifies as Limosilactobacillus 
#>  Do you want to use this ASV to calculate the percentage? [answer yes or no]ASV0002 does not belong to the MOCK community. It representes a  0.003881  perc. of the sequences 
#>  and it classifies as Streptomyces 
#>  Do you want to use this ASV to calculate the percentage? [answer yes or no]You made a decision! ASV0002 which classifies as Streptomyces 
#>  and represents a 0.003881 perc. of the sequences, was used to calculate the percentage

MockCommunity makes a question to user before making the decision. Why? Sometimes, like in the example above, an ASV could be classified with a different name, still being the same genera. In this example, the first ASV which is not detected in the MOCK community composition table, is Limosilactobacillus. However, we can see that it represents a high percentage of sequences. This is probably a bacillus from MOCK samples. Because of that, we decide to choose the second option.

Once the percentage is applied, we can remove sequences coming from chloroplast or mitochondria.

ASV_final_table<-ASV_filtered_sorted[(which(ASV_filtered_sorted$Genus!="Streptophyta"
                                            & ASV_filtered_sorted$Genus!="Chlorophyta"
                                            &ASV_filtered_sorted$Genus!="Bacillariophyta"
                                            &ASV_filtered_sorted$Family!="Streptophyta"
                                            & ASV_filtered_sorted$Family!="Chlorophyta"
                                            &ASV_filtered_sorted$Family!="Bacillariophyta"
                                            & ASV_filtered_sorted$Family!="Mitochondria"
                                            & ASV_filtered_sorted$Class!="Chloroplast"
                                            & ASV_filtered_sorted$Order!="Chloroplast"
                                            & ASV_filtered_sorted$Kingdom!="Eukaryota"
                                            & ASV_filtered_sorted$Kingdom!="unclassified")),]

After this, the presence of Cyanobacteria/Chloroplast at phylum level was checked. There were only two ASV classified as Cyanobacteria/Chloroplast at phylum level which were not classified at deeper levels, so we remove them, as follow:

ASV_final_table<-ASV_final_table[(which(ASV_final_table$Phylum!="Cyanobacteria/Chloroplast"
)),]

Diversity analysis

We almost have all components to perform a full diversity analysis. For this purpose, we need a phyloseq object. Before that, we should produce a phylogenetic tree, which allows us to use UniFrac and WeightedUnifrac distances later on. We will use programs as MAFFT5 and FastTreeMP6.

library(seqinr)
## GET THE FASTA FILE ##
ASVseqs_MOCK <- as.list(ASV_final_table$ASV_seqs)
write.fasta(ASVseqs_MOCK, names=ASV_final_table$ASV_names, file.out="ASV_tree.fas", open = "w", nbchar =1000 , as.string = FALSE)

##Get the alignment ##
mafft <- "/usr/bin/mafft"     #path to program

system2(mafft, args="--version")
system2(mafft, args=c("--auto", "ASV_tree.fas>", "alignment"))

## Get the tree ##
FastTreeMP <- "/home/programs/FastTreeMP/FastTreeMP"

system2(FastTreeMP, args="--version" )
system2(FastTreeMP, args = c("-gamma", "-nt", "-gtr", "-spr",4 ,"-mlacc", 2,  "-slownni", "<alignment>", "tree"))# Run FastTreeMP

##Introduce phylogenetic tree

tree <-  phyloseq::read_tree("tree")

ITS considerations

ITS region is not appropriate for calculating phylogenetic trees. As a result, no Unifrac or Weighted Unifrac distances can be applied.

Create phyloseq

We will put all the information we have (classification, abundance, metadata, ASV sequences and phylogenetic tree) in one object thanks to phyloseq7. We get metadata table from micro4allpackage, as it is stored as data, but it simply consist on a data frame with sample names and location. Later on, this will allow us to group samples by location.

##Load packages
library(phyloseq); packageVersion("phyloseq")
#> [1] '1.36.0'
library(Biostrings); packageVersion("Biostrings")
#> [1] '2.60.2'
library(GUniFrac); packageVersion("GUniFrac")
#> [1] '1.3'
library(phangorn); packageVersion("phangorn")
#> [1] '2.7.1'
library(vegan); packageVersion("vegan")
#> [1] '2.5.7'
library(pheatmap); packageVersion("pheatmap")
#> [1] '1.0.12'
library(colorspace); packageVersion("colorspace")
#> [1] '2.0.2'
####CREATE PHYLOSEQ OBJECT FROM TABLES
##Get tax, otu and sequences
tax <- ASV_final_table[,2:8] #Tax

OTU <-  ASV_final_table[,9:ncol(ASV_final_table)] #ASV
colnames(OTU) <- str_replace_all(colnames(OTU),"-", "_")

dna<-Biostrings::DNAStringSet(ASV_final_table$ASV_seqs) #Sequences
names(dna)<- ASV_final_table$ASV_names

##ADD ASV NAMES
row.names(tax)<-ASV_final_table$ASV_names
row.names(OTU)<-ASV_final_table$ASV_names
#Check rownames are equal
identical(rownames(OTU), rownames(tax))
#> [1] TRUE
##CONVERT TO PHYLOSEQ FORMART
metadata.micro <-micro4all::metadata
metadata.micro$samples <- str_replace_all(metadata.micro$samples,"-", "_")
rownames(metadata.micro) <- str_replace_all(rownames(metadata.micro),"-", "_")
as.data.frame(gsub("-",replacement = ".", metadata.micro))
#>                                                                                                                                                                                                                                                                                gsub("-", replacement = ".", metadata.micro)
#> 1                                                                                                                         c("L1_1", "L1_2", "L1_3", "L1_4", "L1_5", "L1_6", "L1_7", "L1_8", "L2_1", "L2_2", "L2_3", "L2_4", "L2_5", "L2_6", "L2_7", "L2_8", "L3_1", "L3_2", "L3_3", "L3_4", "L3_5", "L3_6", "L3_7", "L3_8")
#> 2 c("location1", "location1", "location1", "location1", "location1", "location1", "location1", "location1", "location2", "location2", "location2", "location2", "location2", "location2", "location2", "location2", "location3", "location3", "location3", "location3", "location3", "location3", "location3", "location3")
phy_OTUtable<-otu_table(OTU, taxa_are_rows = T)
phy_taxonomy<-tax_table(as.matrix(tax))
phy_metadata<-sample_data(metadata.micro)

##Introduce phylogenetic tree
unrooted_tree<- tree
is.rooted(unrooted_tree)
#> [1] FALSE
##Produce root
tree_root<-root(unrooted_tree, 1, resolve.root = T)
tree_root
#> 
#> Phylogenetic tree with 1035 tips and 1034 internal nodes.
#> 
#> Tip labels:
#>   ASV0552, ASV0148, ASV1054, ASV0044, ASV0438, ASV0153, ...
#> Node labels:
#>   Root, , 0.000, 0.000, 0.886, 0.807, ...
#> 
#> Rooted; includes branch lengths.
is.rooted(tree_root)
#> [1] TRUE
##Add tree to phyloseq
location_phyloseq<-phyloseq(phy_OTUtable,phy_taxonomy,phy_metadata,dna,tree_root)

\(\alpha\) diversity

Data preparation

First, we will produce a table with \(\alpha\) diversity indices. For this purpose, we will rarefy the data to the lowest sequence number. This is done in order to avoid an over estimation of diversity only because of a greater number of sequences.

Rarefaction curves should be produced first. This makes it easy to visualize if rarefaction process encountered a great loss of information or not. This plot represents number of ASV (richness) against number of sequences. Moreover, a vertical line will show the number of sequences where normalization occurred. If this line is found in the asymptote of every rarefaction curve, that means we are safe: although we would have more sequences for every sample, we would not have a significant greater number of ASV.

ASV <- as.data.frame(t(otu_table(location_phyloseq)))
sample_names <- rownames(ASV)

#Generate rarefaction curves
rarecurve <- rarecurve(ASV, step = 100, label = F)

We will make some preparation for ggplot

#For each rarefaction curve, transform rarecurve output to a dataframe.
rarecurve_table <- lapply(rarecurve, function(x){
  b <- as.data.frame(x)
  b <- data.frame(ASV = b[,1], raw.read = rownames(b))
  b$raw.read <- as.numeric(gsub("N", "",  b$raw.read))
  return(b)
})

#Produce a column with sample names and put everything in one data frame
names(rarecurve_table) <- sample_names
rarecurve_table <- purrr::map_dfr(rarecurve_table, function(x){
  z <- data.frame(x)
  return(z)
}, .id = "Sample")


#To color lines according to group, let's create a new column
#Coloring
color <- NULL
for (i in (1:length(rarecurve_table$Sample))) {
  color <- rbind(color,metadata.micro$location[which(metadata.micro$samples==rarecurve_table$Sample[i])])##Change "Location" to variable of interest
}

#Bind this column
rarecurve_table_color <- cbind(rarecurve_table, color)
colnames(rarecurve_table_color) <- c(colnames(rarecurve_table), "Location")


## RARECURVE WITH GGPLOT ##
rareggplot<-ggplot(rarecurve_table_color, aes(x=raw.read, y=ASV, colour=Location, group=Sample)) + ### IMPORTANT TO GROUP BY SAMPLE
  theme_bw()+
  geom_point(aes(colour=Location), size=1)+
  geom_line(aes(colour=Location),size=0.5)+ #Change this for line thickness
  geom_vline(aes(xintercept = min(sample_sums(location_phyloseq))), lty=1, colour="black")+
  scale_fill_manual(values = c("location1"= "#33CC00", "location2"= "#ffd624", "location3"="#80CC88"))+
  scale_color_manual(values = c("location1"= "#33CC00","location2"= "#ffd624", "location3"="#80CC88"),
                     name="Location",
                     breaks=c("location1", "location2","location3"),
                     labels=c("Location 1","Location 2","Location 3"))+
  labs(title="Rarefaction curves", x="Number of sequences", y="Number of ASV")+
  guides(alpha="none")+
  theme(legend.key=element_blank(),
        legend.title.align = 0.85,
        legend.title = element_text(face="bold",size=14),
        axis.text = element_text(size=14),
        axis.title = element_text(size = 16),
        plot.title = element_text(hjust=0.5, face="bold", size=16),
        legend.text = element_text(size = 12))

rareggplot

As we can see, the rarefaction process took place in the asymptote of all samples.

For saving plots in your machine, an easy and versatile option is to use ggsave form ggplot. For publication, you can save it in .TIFF, but there are also other interesting formats. For example, .svg allows you to open the file in PowerPoint and easily modify it. If you have more knowledge of softwares as Illustrator you can save it in .eps

#tiff
ggsave(filename = "rarefaction_curve.tiff", plot = rareggplot,device = tiff(),width = 18, height = 16, units = "cm", dpi = 800)

#svg
library(svglite)
ggsave(filename = "rarefaction_curve.svg", plot = rareggplot,device = svg())
#> Saving 7 x 7 in image
#eps
postscript("curve_wito_RE.1.10.eps")
rareggplot
dev.off()  # Close the device
#> png 
#>   2

Let’s continue! Now, we will compute α diversity indices. First, let’s rarefied samples, as everything worked well.

rarefaction <- rarefy_even_depth(location_phyloseq, sample.size = min(sample_sums(location_phyloseq)), rngseed = TRUE)
#> `set.seed(TRUE)` was used to initialize repeatable random subsampling.
#> Please record this for your records so others can reproduce.
#> Try `set.seed(TRUE); .Random.seed` for the full vector
#> ...
#Get the seed and record it for reproducible analysis
.Random.seed[1]
#> [1] 10403
##[1] 10403
#Estimate alpha indices and save it
alpha_diversity_rarefied <- estimate_richness(rarefaction,measures=c("InvSimpson", "Shannon", "Observed"))

## CALCULATE EVENNESS
tASV <- as.data.frame(t(otu_table(rarefaction)))

Evenness_index <- as.data.frame(alpha_diversity_rarefied$Shannon/log(specnumber(tASV)))
Evenness <- cbind(Evenness_index, rownames(alpha_diversity_rarefied))
colnames(Evenness) <- c("Evenness", "Samples")

In this step, it is worth mentioning that we have not used diversity estimates of richness. These estimates rely on singletons. However, dada2 doesn’t produce singletons as ASV, they can only appear because of the merging process (being not a real singleton for a richness estimate to rely on). For more details, check dada2 issue #92.

#Generate final table
alpha_diversity_table <- cbind(alpha_diversity_rarefied, Evenness$Evenness, metadata.micro)
colnames(alpha_diversity_table)<- c("Observed", "Shannon", "InvSimpson", "Evenness",colnames(metadata.micro))

We got the final table with the diversity indices we want to study, but it is also interesting to produce a table ready for publication.

#Select numeric columns and compute mean and SD
alpha_mean <- aggregate(alpha_diversity_table[,1:4], list(grouping=alpha_diversity_table$location), mean)%>% mutate_if(is.numeric, round, digits=2)
alpha_sd <- aggregate(alpha_diversity_table[,1:4], list(grouping=alpha_diversity_table$location), sd)%>% mutate_if(is.numeric, round, digits=2)

#Paste mean ± SD 
mean_sd <- NULL
for (i in 2:5){
  mean_sd <- cbind(mean_sd,paste0(alpha_mean[,i], " +/- ", alpha_sd[,i]))}


#generate table and give it columns names
alpha_pub_table <- cbind(alpha_mean$grouping, mean_sd)
colnames(alpha_pub_table) <- c("Location","Observed", "Shannon", "InvSimpson", "Evenness")


as.data.frame(alpha_pub_table)
#>    Location         Observed       Shannon     InvSimpson      Evenness
#> 1 location1  266.5 +/- 60.34 3.74 +/- 0.35  15.3 +/- 6.89 0.67 +/- 0.04
#> 2 location2 288.62 +/- 51.55 3.98 +/- 0.36 22.96 +/- 6.99  0.7 +/- 0.05
#> 3 location3 249.88 +/- 45.05 3.78 +/- 0.28  16.4 +/- 4.25 0.69 +/- 0.04

Statistical analysis

First, we will check the homogeneity of variances and normality of data, in order to choose a proper statistical method later on. Thanks to levene.test.alpha and Shapiro functions from micro4all, we can perform these tests for every index in just one step.

levene<- levene.test.alpha(alpha_diversity_table, 4, "location")
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.

#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.

#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.

#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
levene
#>        Df   F value    Pr(>F) IndexColumn
#> group   2 0.4888944 0.6201127    Observed
#>        21        NA        NA    Observed
#> group1  2 0.2580918 0.7749391     Shannon
#>  1     21        NA        NA     Shannon
#> group2  2 0.4535917 0.6414220  InvSimpson
#>  2     21        NA        NA  InvSimpson
#> group3  2 0.2517571 0.7797466    Evenness
#>  3     21        NA        NA    Evenness
shapiro <- Shapiro(alpha_diversity_table, 4, "location")
shapiro
#>             p.value      Index
#> shapiro   0.5560323   Observed
#> shapiro.1 0.6978655    Shannon
#> shapiro.2 0.5233759 InvSimpson
#> shapiro.3 0.8245797   Evenness

None of them returns a significant value. That means, according to these statistics tests, every diversity index from our data follows a normal distribution with no significant differences in variance (homoscedasticity). With this information we can perform an ANOVA test. With BalancedAnovafunction from micro4all, it is possible to run an ANOVA test for every index at once. BalancedAnova returns a list with two elements, first element is a data frame with summary results for all indices.

balanced_anova<- BalancedAnova(alpha_diversity_table, numberOfIndexes = 4, formula = "location")

balanced_anova[[1]]
#>                  Df       Sum Sq      Mean Sq  F value     Pr(>F) IndexColum
#> data[, formula]   2 6.046583e+03 3.023292e+03 1.089031 0.35480765   Observed
#> Residuals        21 5.829875e+04 2.776131e+03       NA         NA   Observed
#> data[, formula]1  2 2.627533e-01 1.313767e-01 1.185111 0.32534387    Shannon
#> Residuals      1 21 2.327975e+00 1.108560e-01       NA         NA    Shannon
#> data[, formula]2  2 2.747742e+02 1.373871e+02 3.605388 0.04507594 InvSimpson
#> Residuals      2 21 8.002272e+02 3.810606e+01       NA         NA InvSimpson
#> data[, formula]3  2 3.980461e-03 1.990230e-03 1.173966 0.32862030   Evenness
#> Residuals      3 21 3.560142e-02 1.695306e-03       NA         NA   Evenness

Remember that, in case you have an unbalanced experiment, you can always use the function UnbalancedAnova in a similar way. Moreover, if you prefer to use a non-parametrical test, the function KruskalWallis is also available.

Now that we have an ANOVA result, imagine we want to test differences between every two groups, that is, L1 vs L2, L2 vs L3 and L1 vs L3. Let’s do that, again for every index, with Tukey.test from micro4all.

tukey <- Tukey.test(alpha_diversity_table, 4, "location", balanced=TRUE)

tukey
#>                               diff           lwr         upr      p adj IndexColumn
#> location2.location1    22.12500000  -44.27816558 88.52816558 0.68300608    Observed
#> location3.location1   -16.62500000  -83.02816558 49.77816558 0.80484680    Observed
#> location3.location2   -38.75000000 -105.15316558 27.65316558 0.32461516    Observed
#> location2.location1.1   0.23770146   -0.18191097  0.65731389 0.34520237     Shannon
#> location3.location1.1   0.03584613   -0.38376631  0.45545856 0.97479481     Shannon
#> location3.location2.1  -0.20185533   -0.62146777  0.21775710 0.45910399     Shannon
#> location2.location1.2   7.66241257   -0.11734048 15.44216562 0.05404339  InvSimpson
#> location3.location1.2   1.09519735   -6.68455570  8.87495040 0.93316996  InvSimpson
#> location3.location2.2  -6.56721522  -14.34696827  1.21253784 0.10827298  InvSimpson
#> location2.location1.3   0.03147496   -0.02041613  0.08336606 0.29814208    Evenness
#> location3.location1.3   0.01391228   -0.03797881  0.06580337 0.77992656    Evenness
#> location3.location2.3  -0.01756268   -0.06945377  0.03432841 0.67493215    Evenness

Another pairwise test option is the Dunn’s test, which is less powerful but make no assumptions on data

dunn <- dunnT(alpha_diversity_table, 4, "location")
#>   Kruskal-Wallis rank sum test
#> 
#> data: x and group
#> Kruskal-Wallis chi-squared = 1.3543, df = 2, p-value = 0.51
#> 
#> 
#>                            Comparison of x by group                            
#>                              (Benjamini-Hochberg)                              
#> Col Mean-|
#> Row Mean |   location   location
#> ---------+----------------------
#> location |  -0.848712
#>          |     0.2970
#>          |
#> location |   0.265222   1.113935
#>          |     0.3954     0.3980
#> 
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
#>   Kruskal-Wallis rank sum test
#> 
#> data: x and group
#> Kruskal-Wallis chi-squared = 2.615, df = 2, p-value = 0.27
#> 
#> 
#>                            Comparison of x by group                            
#>                              (Benjamini-Hochberg)                              
#> Col Mean-|
#> Row Mean |   location   location
#> ---------+----------------------
#> location |  -1.520279
#>          |     0.1927
#>          |
#> location |  -0.282842   1.237436
#>          |     0.3886     0.1619
#> 
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
#>   Kruskal-Wallis rank sum test
#> 
#> data: x and group
#> Kruskal-Wallis chi-squared = 5.915, df = 2, p-value = 0.05
#> 
#> 
#>                            Comparison of x by group                            
#>                              (Benjamini-Hochberg)                              
#> Col Mean-|
#> Row Mean |   location   location
#> ---------+----------------------
#> location |  -2.368807
#>          |     0.0268
#>          |
#> location |  -0.707106   1.661700
#>          |     0.2398     0.0724
#> 
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
#>   Kruskal-Wallis rank sum test
#> 
#> data: x and group
#> Kruskal-Wallis chi-squared = 2.315, df = 2, p-value = 0.31
#> 
#> 
#>                            Comparison of x by group                            
#>                              (Benjamini-Hochberg)                              
#> Col Mean-|
#> Row Mean |   location   location
#> ---------+----------------------
#> location |  -1.520279
#>          |     0.1927
#>          |
#> location |  -0.707106   0.813172
#>          |     0.2398     0.3121
#> 
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
dunn[[2]]
#>        chi2          Z           P P.adjusted           comparisons IndexColumn
#> 1  1.354339 -0.8487127 0.198020600 0.29703090 location1 - location2    Observed
#> 2  1.354339  0.2652227 0.395418952 0.39541895 location1 - location3    Observed
#> 3  1.354339  1.1139354 0.132653458 0.39796037 location2 - location3    Observed
#> 4  2.615000 -1.5202796 0.064220362 0.19266109 location1 - location2     Shannon
#> 5  2.615000 -0.2828427 0.388648705 0.38864871 location1 - location3     Shannon
#> 6  2.615000  1.2374369 0.107962469 0.16194370 location2 - location3     Shannon
#> 7  5.915000 -2.3688077 0.008922764 0.02676829 location1 - location2  InvSimpson
#> 8  5.915000 -0.7071068 0.239750061 0.23975006 location1 - location3  InvSimpson
#> 9  5.915000  1.6617009 0.048286377 0.07242957 location2 - location3  InvSimpson
#> 10 2.315000 -1.5202796 0.064220362 0.19266109 location1 - location2    Evenness
#> 11 2.315000 -0.7071068 0.239750061 0.23975006 location1 - location3    Evenness
#> 12 2.315000  0.8131728 0.208059496 0.31208924 location2 - location3    Evenness

When using KruskalWallis function, the pairwise test to be implemented should not be Tukey’s test, because it assumes normal distribution of data. Instead, use Dunn’s test or Mann–Whitney–Wilcoxon. micro4allalso implements looping functions for these tests

wilcoxon <- wilcoxon.test(alpha_diversity_table, 4, "location", p.adjust.method = "BH")
#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
wilcoxon
#>              location1 location2      Index
#> location2   0.69288072        NA   Observed
#> location3   0.87847708 0.6928807   Observed
#> location2.1 0.41794872        NA    Shannon
#> location3.1 0.57373737 0.3911422    Shannon
#> location2.2 0.08438228        NA InvSimpson
#> location3.2 0.44180264 0.1244755 InvSimpson
#> location2.3 0.44180264        NA   Evenness
#> location3.3 0.44180264 0.4418026   Evenness

Plot \(\alpha\) diversity indices

alpha_plot_table <- tidyr::gather(alpha_diversity_table, key = "Measure", value = "Value", Observed, Shannon, InvSimpson, Evenness)

alpha_graphic <- ggplot(data = alpha_plot_table, aes(x = location, y = Value)) +
  facet_wrap(~Measure, scale = "free") +
  geom_boxplot()+
  theme(axis.text.x = element_text(size=13),
        legend.position="bottom", strip.text = element_text(size = 20), axis.text.y = element_text(size=15), axis.title.y=element_text(size=20)) +
  scale_x_discrete(limits=c("location1",
                            "location2","location3"), breaks=c("location1",
                            "location2","location3"),   ##With breaks and labels you can change the name displayed on labels
                   labels=c("Location 1",
                            "Location 2","Location 3")) +

  aes(fill=location)+ scale_fill_manual(values = c( "location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="#80CC88"), na.translate=FALSE) +
  theme(legend.key.size = unit(1, 'cm')) +
  ylab("Alpha Diversity Measure")

alpha_graphic

#use ggsave for saving ggplots
#ggsave("alpha_plot.tiff", plot = alpha_graphic,width = 17, height = 30, units = "cm", dpi = 800 )

Both statistical analysis and graphic show no significant differences between groups. Yet there could be distinct groups when studying β diversity, i.e. integrating abundance and taxonomic information.

\(\beta\) diversity analysis

Data preparation

For \(\beta\) diversity analysis, we first normalize data making use of edgeR8 package. This is implemented to account for differences in library sizes between samples, as well as for compositionality.

library(edgeR); packageVersion("edgeR")
#> [1] '3.34.1'
edgeR <- DGEList(counts = OTU, samples = metadata.micro, genes = tax)
edgeR <- calcNormFactors(edgeR)

##EXTRACT NORMALIZED COUNTS
ASV_norm <- cpm(edgeR, normalized.lib.sizes=T, log=F)

##CREATE NORMALIZED PHYLOSEQ OBJECT
phy_OTU_norm<-otu_table(as.data.frame(ASV_norm,row.names=F), taxa_are_rows = T)
phy_taxonomy_norm<-tax_table(as.matrix(tax))
phy_metadata_norm<-sample_data(metadata.micro)

##Add taxa names
taxa_names(phy_OTU_norm)<- taxa_names(phy_taxonomy_norm)
#Check
identical(rownames(ASV_norm), rownames(tax))
#> [1] TRUE
##Merge
normalized_phyloseq<-phyloseq(phy_OTU_norm,phy_taxonomy_norm,phy_metadata_norm,tree_root)

Once we have a normalized phyloseq object, it is possible to analyze \(\beta\) diversity. A distance method will give us a measure of dissimilarities between replicates. Then, PERMANOVA tests wheter differences between groups are significant or not. We will make use of Permanova function from micro4all, in order to implement different kind of distance measures at once.

permanova<- Permanova(normalized_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location")


permanova[[1]]
#>            Df SumsOfSqs    MeanSqs  F.Model        R2 Pr(>F) distances
#> location    2 1.4040287 0.70201437 4.510496 0.3004895  0.001      bray
#> Residuals  21 3.2684435 0.15564017       NA 0.6995105     NA      bray
#> Total      23 4.6724722         NA       NA 1.0000000     NA      bray
#> location1   2 0.5426057 0.27130284 2.885410 0.2155638  0.005   unifrac
#> Residuals1 21 1.9745409 0.09402576       NA 0.7844362     NA   unifrac
#> Total1     23 2.5171466         NA       NA 1.0000000     NA   unifrac
#> location2   2 0.1382008 0.06910039 2.118345 0.1678782  0.030  wunifrac
#> Residuals2 21 0.6850197 0.03261999       NA 0.8321218     NA  wunifrac
#> Total2     23 0.8232205         NA       NA 1.0000000     NA  wunifrac

Bray-Curtis distance seems to explained the most variance (expressed as R2 in the above table). Nevertheless, another test is necessary to confirm no differences are found between dispersion. If significant differences are to be found with a betadisper test, then differences found with PERMANOVA could be only due to distinct dispersion and further graphic representation would be needed.

Again, we can loop over distances measures thanks to micro4all

betadisper<- Betadispersion(location_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location")
betadisper
#> [[1]]
#>            Df      Sum Sq     Mean Sq         F N.Perm Pr(>F) distances
#> Groups      2 0.009769604 0.004884802 0.7280460    999  0.476      bray
#> Residuals  21 0.140898839 0.006709469        NA     NA     NA      bray
#> Groups1     2 0.009311883 0.004655941 0.8646057    999  0.461   unifrac
#> Residuals1 21 0.113085961 0.005385046        NA     NA     NA   unifrac
#> Groups2     2 0.003544450 0.001772225 0.6221726    999  0.533  wunifrac
#> Residuals2 21 0.059817359 0.002848446        NA     NA     NA  wunifrac
#> 
#> [[2]]
#> [[2]]$bray
#> 
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#> 
#> Response: Distances
#>           Df  Sum Sq   Mean Sq     F N.Perm Pr(>F)
#> Groups     2 0.00977 0.0048848 0.728    999  0.476
#> Residuals 21 0.14090 0.0067095                    
#> 
#> [[2]]$unifrac
#> 
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#> 
#> Response: Distances
#>           Df   Sum Sq   Mean Sq      F N.Perm Pr(>F)
#> Groups     2 0.009312 0.0046559 0.8646    999  0.461
#> Residuals 21 0.113086 0.0053850                     
#> 
#> [[2]]$wunifrac
#> 
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#> 
#> Response: Distances
#>           Df   Sum Sq   Mean Sq      F N.Perm Pr(>F)
#> Groups     2 0.003544 0.0017722 0.6222    999  0.533
#> Residuals 21 0.059817 0.0028485

In this example, dispersion was not statistically significant different between the three groups.

But, which groups are distinct? We can make a pairwiseAdonis test to check for differences. Again, micro4allgives us the option to loop over distances

pairwise<- PairwiseAdonisFun(normalized_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location", pval=0.05)


pairwise
#>                    pairs Df SumsOfSqs  F.Model        R2 p.value p.adjusted sig distances
#> 1 location1 vs location2  1 0.3391022 2.076899 0.1291853   0.004      0.004   *      bray
#> 2 location1 vs location3  1 0.9248550 6.081317 0.3028346   0.001      0.003   *      bray
#> 3 location2 vs location3  1 0.8420859 5.555908 0.2841038   0.002      0.003   *      bray
#> 4 location1 vs location2  1 0.2674323 2.821635 0.1677384   0.042      0.042   .   unifrac
#> 5 location1 vs location3  1 0.2743299 2.772519 0.1653013   0.016      0.042   .   unifrac
#> 6 location2 vs location3  1 0.2721463 3.080251 0.1803399   0.035      0.042   .   unifrac

It seems that, according to statistical analysis used and with Bray-Curtis distances, all groups are grouped distinctively. For visualization, we can use PCoA and NMDS with every measure. Here it is a loop for producing every possible graphic. It assigns each graphic to elements of a list.

var_list <-  c("bray", "unifrac", "wunifrac")
plot_type <-  c("PCoA", "NMDS")
combination_plot <- purrr::cross2(plot_type,var_list )


# Make plots.
plot_list = list()

for (i in 1:length(combination_plot)){
      pcoa <- ordinate(normalized_phyloseq,combination_plot[[i]][[1]],combination_plot[[i]][[2]])
      plot = plot_ordination(normalized_phyloseq, pcoa, type="samples", color="location")+
        geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+
        geom_point(size=4)
      if (combination_plot[[i]][[1]]=="NMDS"){
        plot= plot + xlab(paste("Axis 1"))+
        ylab(paste("Axis 2"))+
          theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
                panel.border = element_rect(linetype = "solid", fill = NA),
                panel.background = element_rect(fill="white", colour="white"),
                panel.grid.major = element_line(colour="aliceblue", size=0.4),
                panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+

          scale_color_manual(values=c("location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="#80CC88"))+
          guides(color=guide_legend(nrow=2,byrow=TRUE))+
          guides(shape=guide_legend(nrow=2,byrow=TRUE))+
          theme(plot.title = element_text(face="bold", hjust = 0.5))+
          ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances", round(pcoa$stress, digits = 3)))

        plot_list[[i]] = plot

      }
      else {
        plot= plot +   xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
          ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+

          theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
                panel.border = element_rect(linetype = "solid", fill = NA),
                panel.background = element_rect(fill="white", colour="white"),
                panel.grid.major = element_line(colour="aliceblue", size=0.4),
                panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+

          scale_color_manual(values=c("location1"= "#33CC00",
                                                          "location2"= "#ffd624", "location3"="#80CC88"))+
          guides(color=guide_legend(nrow=2,byrow=TRUE))+
          guides(shape=guide_legend(nrow=2,byrow=TRUE))+
          theme(plot.title = element_text(face="bold", hjust = 0.5))+
          ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances"))

        plot_list[[i]] = plot

      }

       }



plot_list[[1]]

Going deeper in the analysis, we can get a bar plot with the variance explained by every axis. However, we should be careful in this sense because we are using non-Euclidian distances. This is why we can find negative eigenvalues (imaginary dimensions). If we want to visualize corrected eigenvalues, select element Corr_eig in pcoa.

PCOA_Wunifrac  <-  ordinate(normalized_phyloseq, "PCoA", "wunifrac", formula="location")
length(PCOA_Wunifrac$values$Relative_eig)
#> [1] 24
barplot(100*PCOA_Wunifrac$values$Relative_eig, ylim=c(0,10+100*max(PCOA_Wunifrac$values$Relative_eig)),
        xlab=c("Axes"),ylab = c("Eigenvalue (%)"),axisnames = T,
        axis.lty = 1)

And with the next line, we can get the percentage of variance explained by the number of axes we choose.

sum(PCOA_Wunifrac$values$Relative_eig[1:10])
#> [1] 0.9826669

Produce tables with relative abundance

For further analysis, or to represent data, it is useful to have tables with relative abundance at different levels. First, we usually produce a table with relative abundance by sample.

##CALCULATE RELAIVE ABUNDANCE
sample_relabun <- transform_sample_counts(location_phyloseq, function(x){x/sum(x)}*100)
##CONSTRUCT THE TABLE
OTU_sample <-  as.data.frame((otu_table(sample_relabun)))
taxonomy_sample <- as.data.frame(tax_table(sample_relabun))
identical(rownames(OTU_sample),rownames(taxonomy_sample))
#> [1] TRUE
sample_relabun_table  <- cbind(taxonomy_sample, OTU_sample)
#SORT IT
sample_relabun_table <- sample_relabun_table[sort(sample_relabun_table$ASV_names, decreasing = FALSE),]

#Check relative abundance sums 100
colSums(sample_relabun_table[,8:ncol(sample_relabun_table)])
#> L1_1 L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 
#>  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100 
#> L3_4 L3_5 L3_6 L3_7 L3_8 
#>  100  100  100  100  100
head(sample_relabun_table)
#>          Kingdom         Phylum          Class             Order                Family
#> ASV0002 Bacteria Actinobacteria Actinobacteria  Streptomycetales     Streptomycetaceae
#> ASV0003 Bacteria   unclassified   unclassified      unclassified          unclassified
#> ASV0006 Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0008 Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria     Micrococcales Promicromonosporaceae
#>                     Genus ASV_names       L1_1      L1_2      L1_3       L1_4       L1_5
#> ASV0002      Streptomyces   ASV0002 13.8889992  4.937848 10.893415 22.1990687  9.5526244
#> ASV0003      unclassified   ASV0003  3.8330224 27.683538 26.364099 15.4691786 10.6119119
#> ASV0006      unclassified   ASV0006  2.5943864  6.496807  7.246131 10.9847116 19.6124558
#> ASV0007    Pseudonocardia   ASV0007  1.3855254  3.001168  2.136059  0.9440983  0.4732119
#> ASV0008      unclassified   ASV0008 11.3065227  2.218254  2.586520  1.1120798  7.1675279
#> ASV0009 Promicromonospora   ASV0009  0.1905594  1.504018  2.310431  1.8605541  0.1060647
#>               L1_6       L1_7     L1_8       L2_1       L2_2       L2_3      L2_4       L2_5
#> ASV0002  6.9848271 20.4381884 5.620269  7.6930658  7.2420046 11.2892520 3.0031259 14.4508131
#> ASV0003 24.8232220 10.3801344 5.665003 10.5471676  3.6577360 22.6933878 7.1731339  1.5160285
#> ASV0006  6.2825092  5.3759593 1.043176  6.6385461  4.5674805  7.4010722 7.1764071  1.9043707
#> ASV0007  1.8744457  2.3142073 0.925081  2.0439942  0.3016851  0.5514424 1.8673388  0.6254551
#> ASV0008  4.2882140  3.0677164 6.514932 14.3000749 12.2581049  7.5312739 1.3452695  4.4099251
#> ASV0009  0.4674129  0.3419619 3.689588  0.2030197  7.7359552  0.1582844 0.5007937  7.1880659
#>              L2_6     L2_7     L2_8       L3_1      L3_2        L3_3       L3_4      L3_5
#> ASV0002 12.829278 8.654647 9.041784 13.6220214  5.024374 17.70502153  7.8901345  8.995795
#> ASV0003  0.921740 9.530792 9.079588  8.5899753 11.863936 10.48276173 11.3430493  3.270292
#> ASV0006  1.774789 6.913218 5.461540  0.3648316  1.406825  0.62233647  0.7892377  1.592300
#> ASV0007  1.442515 4.163499 3.624719 14.8167625 16.529120  5.98209655  9.9327354  4.862592
#> ASV0008  8.960207 1.828319 6.101981  0.0000000  0.000000  0.00000000  0.0000000  0.000000
#> ASV0009  2.551159 1.178451 1.461006  7.1783073  1.233644  0.03156779  1.8609865 21.526934
#>                L3_6       L3_7        L3_8
#> ASV0002 19.96646697 19.7996690  5.78009910
#> ASV0003 14.42738258  9.9207386 21.88830856
#> ASV0006  2.58040771  0.1428447  0.82764162
#> ASV0007 16.93827060  3.8899051 13.41368689
#> ASV0008  0.00000000  0.0000000  0.00000000
#> ASV0009  0.01840227  0.5853149  0.03214142

Of course, we can do it at genus level or phylum level

#Glom phyloseq at taxonomical level

location_phyloseq_genus <- tax_glom(location_phyloseq, taxrank = "Genus")
sample_relabun_genus <- transform_sample_counts(location_phyloseq_genus, function(x){x/sum(x)}*100)
##CONSTRUCT THE TABLE
OTU_sample_genus <-  as.data.frame((otu_table(sample_relabun_genus)))
taxonomy_sample_genus  <- as.data.frame(tax_table(sample_relabun_genus)[,-7])
identical(rownames(OTU_sample_genus),rownames(taxonomy_sample_genus))
#> [1] TRUE
sample_table_genus  <- cbind(taxonomy_sample_genus, OTU_sample_genus)

Furhtermore, we would like to have all unclassifieds at genus level grouped together

##Agglomerate unclassified in out table
sample_table_genus_unclass <- sample_table_genus %>% subset(Genus=="unclassified", select=c(7:ncol(sample_table_genus))) %>%
  colSums() %>%  t() %>% as.data.frame() %>% cbind(Kingdom="unclassified", Phylum="unclassified", Class="unclassified",
                                                   Order="unclassified", Family="unclassified", Genus="unclassified",  .)

sample_table_genus_final <- rbind(subset(sample_table_genus, Genus!="unclassified"), sample_table_genus_unclass)

#Check relative abundance sums 100
colSums(sample_table_genus_final[,8:ncol(sample_table_genus_final)])
#> L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 
#>  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100  100 
#> L3_5 L3_6 L3_7 L3_8 
#>  100  100  100  100
head(sample_table_genus_final)
#>          Kingdom         Phylum          Class             Order             Family
#> ASV0800 Bacteria Actinobacteria Actinobacteria  Streptomycetales  Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0303 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0054 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0020 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#>                     Genus      L1_1     L1_2      L1_3       L1_4        L1_5       L1_6
#> ASV0800       Yinghuangia 0.0000000 0.000000 0.1017171 0.00000000 0.000000000 0.01438194
#> ASV0007    Pseudonocardia 6.9494621 4.721516 8.2560364 4.98628506 8.218656513 6.60370575
#> ASV0010     Amycolatopsis 7.5211402 2.812307 5.9238091 5.62844203 5.384824585 8.73462931
#> ASV0303     Saccharothrix 0.1131446 0.000000 0.1477320 0.00000000 0.000000000 0.10546753
#> ASV0054 Saccharopolyspora 0.0317599 4.937848 0.0000000 0.02551617 0.009518629 0.17018625
#> ASV0020   Actinophytocola 0.3334789 1.637937 1.0292800 3.37238725 1.419635572 0.42666411
#>               L1_7       L1_8      L2_1        L2_2        L2_3       L2_4      L2_5       L2_6
#> ASV0800 0.02584596 0.00000000  0.000000  0.00000000  0.00000000 0.02782187 0.0000000  0.0000000
#> ASV0007 5.39981709 7.61894537  9.589230  4.20639635  5.54505999 3.75595306 4.0253169  6.6039394
#> ASV0010 5.12545230 7.56347630 11.692356 11.93922531 11.59050294 7.65756182 7.7873826 11.8180802
#> ASV0303 0.12127719 0.04294380  0.000000  0.02188389  0.00000000 0.28803823 0.0000000  0.0000000
#> ASV0054 5.67815818 0.06620502  0.000000  0.00000000  0.03829461 0.11456066 0.0000000  0.2460103
#> ASV0020 1.87880234 0.24871616  3.201009  1.70850658  1.77687005 1.76750732 0.9521854  0.3147015
#>               L2_7       L2_8        L3_1       L3_2     L3_3        L3_4       L3_5
#> ASV0800 0.00000000 0.00000000  0.05094495  0.0000000  0.00000  0.00000000 0.08370035
#> ASV0007 9.02031063 8.99286175 22.29252260 23.2938510 11.42529 16.30269058 9.04163096
#> ASV0010 3.93903190 7.29836109  2.71980279  9.5014111  4.71713  5.62780269 4.43014010
#> ASV0303 0.05249629 0.02446130  0.00000000  0.0000000  0.00000  0.14798206 0.00000000
#> ASV0054 0.24256906 0.07116013  0.63434675  0.2501497  0.00000  0.05381166 0.00000000
#> ASV0020 1.93693204 1.49213903  0.14626130  0.9599761  0.00000  0.23542601 0.03387871
#>                L3_6       L3_7       L3_8
#> ASV0800  0.00000000 0.00000000  0.0000000
#> ASV0007 25.36855665 6.38271928 22.1882952
#> ASV0010  5.16899421 2.50675028 13.2610151
#> ASV0303  0.02658106 0.05226026  0.0000000
#> ASV0054  0.00000000 1.48244926  0.2062408
#> ASV0020  0.47845912 0.00000000  2.3650730

Next, we can group it by the variable of interest, in our case, location (at genus level)

otu_genus <- sample_table_genus_final[,7:ncol(sample_table_genus_final)] %>% t() %>% as.data.frame()
#Save TAXonomy data
tax_genus <- sample_table_genus_final[,1:6]

#Calculate OTU mean abundance based on grouping factor (e.g., location)
location_mean <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t()
#Calculate OTU SD  based on grouping factor (e.g., location) and change colnames
location_SD <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1")  %>% t()  %>%
  as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_mean), "SD"))
#Merge mean abundance, SD and taxonomy.
genus_location_table <- merge(tax_genus, location_mean, by=0) %>%column_to_rownames("Row.names") %>%
  merge(location_SD, by=0) %>% column_to_rownames("Row.names")

#Check abundances sum 100
colSums(genus_location_table[,7:ncol(genus_location_table)])
#>   location1   location2   location3 location1SD location2SD location3SD 
#>   100.00000   100.00000   100.00000    51.76728    59.31652    66.66386
head(genus_location_table)
#>              Kingdom         Phylum          Class             Order                Family
#> 1       unclassified   unclassified   unclassified      unclassified          unclassified
#> ASV0002     Bacteria Actinobacteria Actinobacteria  Streptomycetales     Streptomycetaceae
#> ASV0007     Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0009     Bacteria Actinobacteria Actinobacteria     Micrococcales Promicromonosporaceae
#> ASV0010     Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0020     Bacteria Actinobacteria Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#>                     Genus location1 location2  location3 location1SD location2SD location3SD
#> 1            unclassified 42.679895 39.071388 26.6895552    8.899015  10.4270430   8.3959878
#> ASV0002      Streptomyces 17.979426 13.409487 17.3227189    8.244194   5.5368089   9.1979191
#> ASV0007    Pseudonocardia  6.594303  6.467384 17.0369440    1.419046   2.4476910   7.2917718
#> ASV0009 Promicromonospora  1.440138  2.642542  4.0584122    1.461164   3.0727149   7.4498456
#> ASV0010     Amycolatopsis  6.086760  9.215313  5.9916308    1.837111   2.9749138   3.6426730
#> ASV0020   Actinophytocola  1.293363  1.643731  0.5273843    1.044258   0.8294456   0.8103738

We can also aggregate by location at ASV level

#Save OTU data
otu_location_ASV <- sample_relabun_table[,8:ncol(sample_relabun_table)] %>% t() %>% as.data.frame()
#Save Taxnomy data
tax_location_ASV <- sample_relabun_table[,1:7]

#Calculate OTU mean abundance based on grouping factor (e.g., PLOT)
location_ASV_mean <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t()
#Calculate OTU SD  based on grouping factor (e.g., PLOT) and change colnames
location_ASV_SD <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1")  %>% t()  %>%
  as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_ASV_mean), "SD"))

#Merge mean abundance, SD and taxonomy.
ASV_location_table <- merge(tax_location_ASV, location_ASV_mean, by=0) %>%column_to_rownames("Row.names") %>%
  merge(location_ASV_SD, by=0) %>% column_to_rownames("Row.names")

#Check abundances sum 100
colSums(ASV_location_table[,8:ncol(ASV_location_table)])
#>   location1   location2   location3 location1SD location2SD location3SD 
#>   100.00000   100.00000   100.00000   111.04717   115.16294    98.63739

Taxonomical profile graphics

Usually, taxonomical profiles are interesting graphics to present when publishing data. They give a visual idea of the main taxonomical groups present in our comminities.

First, we prepare a table, so we can label genera representing less than a 1% of the sequences as ‘other genera’. Moreover, we reorder this table in a way that graphic will represent genera, unclassified and other genera (from bottom to top).

taxonomical_genus_location <- genus_location_table %>% select(6:9) %>% #select columns with genera and abundance
  pivot_longer(!Genus, names_to="Samples", values_to="Abundance") %>%
  mutate(Genus = case_when(Abundance<= 1.0 & Genus!="unclassified" ~ "Other genera (<1%)", TRUE ~ as.character(Genus)))

#Reorder table (first, descendant, then unclassified and other phyla)
taxonomical_genus_location <- taxonomical_genus_location %>% filter(Abundance > 1 & Genus!="unclassified") %>%
  arrange(desc(Abundance))

taxonomical_genus_location <- rbind(taxonomical_genus_location, filter(taxonomical_genus_location,Genus=="unclassified"|Genus=="Other genera (<1%)"))


head(taxonomical_genus_location)
#> # A tibble: 6 × 3
#>   Genus          Samples   Abundance
#>   <chr>          <chr>         <dbl>
#> 1 Streptomyces   location1     18.0 
#> 2 Streptomyces   location3     17.3 
#> 3 Pseudonocardia location3     17.0 
#> 4 Streptomyces   location2     13.4 
#> 5 Amycolatopsis  location2      9.22
#> 6 Pseudonocardia location1      6.59

Afterwards, we create an expression to introduce in ggplot, in a way that phylum and genera names are writting in italics.

library(gdata)
#Get labels for location
location_label <- levels(as.factor(unique(taxonomical_genus_location$Samples)))

#Get unique names for genera
unique_genera <- unique(as.vector(taxonomical_genus_location$Genus))

#REORDER FACTORS LEVELS IN DATA FRAME
taxonomical_genus_location$Genus=reorder.factor(taxonomical_genus_location$Genus,new.order=rev(unique_genera))

sorted_labels_genus<- as.data.frame(unique_genera)

##CREATE AN EXPRESSION FOR GGPLOT WITH ITALICS WHEN NEEDED.
sorted_labels_ggplot <- sapply(sorted_labels_genus$unique_genera,
                                           function(x) if (x == "Other phyla (<1%)"|x == "unclassified"|x == "Other genera (<1%)")
                                           {parse(text=paste0("'", as.character(x), "'"))} else {parse(text = paste0("italic('",as.character(x), "')"))})

When having to color a great number of objects, randomcoloR package can be very useful. You can combine it with package scales to display the random generated colors, as well as its codes.

library(randomcoloR);packageVersion("randomcoloR")
#> [1] '1.1.0.1'
library(scales);packageVersion("scales")
#> [1] '1.1.1'
colors_genus <-  randomColor(count=length(unique_genera), hue=c("random"))
show_col(colors_genus)

If you want to change some of these colors, it can be easily done as follows:

colors_genus[5] <-  "#448800"

In this tutorial, we will use the next palette, because it has been previously optimized to include very distinctive colors

c21 <- c(
  "dodgerblue2", "#E31A1C", # red
  "green4",
  "#6A3D9A", # purple
  "#FF7F00", # orange
  "black", "gold1",
  "skyblue2", "#FB9A99", # lt pink
  "#CAB2D6", # lt purple
  "#FDBF6F", # lt orange
  "gray70", "khaki2",
  "maroon", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "brown"
)

At last, let’s plot taxonomical profile at genus level!

ggplot_title <- "Bacterial genera by location"

ggGenera=ggplot(taxonomical_genus_location, aes(x=Samples, y=Abundance, fill=Genus, order=Genus)) + geom_bar(stat="identity", position="stack")+
  scale_fill_manual(values=c21,
                    breaks=unique_genera,
                    labels=sorted_labels_ggplot)+

  labs(y="Mean relative abundance (%)", x=NULL, fill="Genus",title=ggplot_title)+
  guides(fill = guide_legend(reverse = TRUE))+
  scale_x_discrete(limits=location_label,
                   labels=location_label)+
  scale_y_continuous(expand=c(0.01,0.01),
                     breaks=c(0,10,20,30,40,50,60,70,80,90,100),
                     labels=c("0","10", "20","30","40","50","60","70","80","90","100"),
                     limits = c(NA, 100))+
  theme_bw()+
  theme(panel.border = element_rect(colour="black"), axis.title.x=element_blank(),
        plot.title = element_text(face="bold", hjust = 0.5, size=16), axis.text = element_text(size = 14),
        axis.text.x = element_text(face="bold", size=10, angle = 45, vjust=1, hjust = 1), axis.title.y = element_text(size = 16),
        legend.key.size = unit(0.9, "cm"), legend.text = element_text(size = 12),
        legend.title = element_text(size=14, face="bold"), legend.title.align=0.5)


ggGenera

ANCOM-BC analysis

There are several methods to perform differential abundance analysis on microbial community data. Among them, we choose ANCOMBC2, because it is specifically designed for this kind of data, accounting for its compositionality. Unfortunately, when having multiple groups to compare, ancombc function only compares the first group against all others. This is a big drawback and requires the user to change the input phyloseq object for every other combination between groups. To solve this problem, micro4allhas implemented the function ancomloop. Apart from looping over groups, it also returns a table with ANCOM corrected logarithmic abundances. Let’s see how it works!

library(ANCOMBC);packageVersion("ANCOMBC")
#> [1] '1.2.2'
library(microbiome);packageVersion("microbiome")
#> [1] '1.14.0'
library(car);packageVersion("car")
#> [1] '3.0.11'
library(ggplot2);packageVersion("ggplot2")
#> [1] '3.3.5'
library(gridExtra);packageVersion("gridExtra")
#> [1] '2.3'
ANCOM_location_genus <- ancomloop(input_object_phyloseq = location_phyloseq_genus, grouping = "location", ancom.options = list(global=FALSE, struc_zero=TRUE),out.unclassified = TRUE, tax.level="Genus") #When performing ancombc at genus level, we can filter unclassified genera out with out.unclassified and tax.level arguments
#> [1] "location1" "location2" "location3"
#> [1] "location2" "location1" "location3"
#> [1] "location3" "location1" "location2"
head(ANCOM_location_genus[["location1"]])
#>          Kingdom         Phylum               Class             Order                Family
#> ASV0002 Bacteria Actinobacteria      Actinobacteria  Streptomycetales     Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria      Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria      Actinobacteria     Micrococcales Promicromonosporaceae
#> ASV0010 Bacteria Actinobacteria      Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0020 Bacteria Actinobacteria      Actinobacteria Pseudonocardiales    Pseudonocardiaceae
#> ASV0021 Bacteria Proteobacteria Alphaproteobacteria       Rhizobiales     Bradyrhizobiaceae
#>                     Genus ASV_names location1MeanLogAbun location2MeanLogAbun
#> ASV0002      Streptomyces      <NA>             8.957076             8.759109
#> ASV0007    Pseudonocardia      <NA>             8.021944             8.045645
#> ASV0009 Promicromonospora      <NA>             5.952887             6.492680
#> ASV0010     Amycolatopsis      <NA>             7.915451             8.407511
#> ASV0020   Actinophytocola      <NA>             6.082074             6.583236
#> ASV0021    Bradyrhizobium      <NA>             5.960075             6.315572
#>         location3MeanLogAbun location1SD location2SD location3SD location1vslocation2_beta
#> ASV0002             8.484159   0.6493031   0.5991199   0.6372969               -0.19796656
#> ASV0007             8.498512   0.3696452   0.4329206   0.7333242                0.02370005
#> ASV0009             5.103344   1.2952050   1.3736179   2.4417865                0.53979248
#> ASV0010             7.411011   0.3356190   0.5825709   0.8391443                0.49206023
#> ASV0020             3.385740   1.2996042   0.8690411   2.8593764                0.50116212
#> ASV0021             6.344847   0.4432286   0.3798151   0.3854025                0.35549642
#>         location1vslocation2_SE location1vslocation2_W location1vslocation2_pvalue
#> ASV0002               0.2921840             -0.6775408                  0.49806287
#> ASV0007               0.1882653              0.1258865                  0.89982180
#> ASV0009               0.6243831              0.8645212                  0.38730162
#> ASV0010               0.2223526              2.2129730                  0.02689951
#> ASV0020               0.5170442              0.9692829                  0.33240408
#> ASV0021               0.1930422              1.8415480                  0.06554129
#>         location1vslocation2_padjusted location1vslocation2_diff location1vslocation3_beta
#> ASV0002                              1                     FALSE                -0.4729166
#> ASV0007                              1                     FALSE                 0.4765678
#> ASV0009                              1                     FALSE                -0.8495434
#> ASV0010                              1                     FALSE                -0.5044403
#> ASV0020                              1                     FALSE                -2.6963336
#> ASV0021                              1                     FALSE                 0.3847719
#>         location1vslocation3_SE location1vslocation3_W location1vslocation3_pvalue
#> ASV0002               0.3008891             -1.5717306                 0.116013048
#> ASV0007               0.2715929              1.7547136                 0.079308315
#> ASV0009               0.9141179             -0.9293586                 0.352703260
#> ASV0010               0.2988944             -1.6876875                 0.091471235
#> ASV0020               1.0387420             -2.5957684                 0.009437965
#> ASV0021               0.1942496              1.9808117                 0.047612394
#>         location1vslocation3_padjusted location1vslocation3_diff
#> ASV0002                              1                     FALSE
#> ASV0007                              1                     FALSE
#> ASV0009                              1                     FALSE
#> ASV0010                              1                     FALSE
#> ASV0020                              1                     FALSE
#> ASV0021                              1                     FALSE

We can filter these results to get only the significant ones and the comparison we would like to analyze

ANCOM_location1_genus <- ANCOM_location_genus[["location1"]]
ANCOM_loc1vsloc2_genus_sig <- ANCOM_location1_genus[,1:19] %>% filter(location1vslocation2_diff,TRUE)

ANCOM-BC graphic

Graphical representation of differential analysis results can be achieved with two approaches: log fold change with corrected abundances or relative abundance without correction. Here, we will show how to accomplish both graphics. For publication purposes, we rather represent both graphics in one figure.

We will start with log fold change. First, we will bind phylum and genus name (this can be replicated when representing ASV).

#Bind phylum name to genus name
graphic_loc1vsloc2_genus<-ANCOM_loc1vsloc2_genus_sig
for (i in 1:nrow(graphic_loc1vsloc2_genus)){
  graphic_loc1vsloc2_genus$Classification[i]<- paste(graphic_loc1vsloc2_genus$Phylum[i],graphic_loc1vsloc2_genus$Genus[i],sep="|")
  }

Next, we create a data frame with data (classification and log fold change) and filter out those genera where log fold change is 0 or less than a specific value. Remember, log fold change correspond to beta parameter in res element from ancombc function.

##Create the data frame for representation, with Classification, Log fold change and standard deviation
logfold_df <-  graphic_loc1vsloc2_genus %>% select(Classification, location1vslocation2_beta, location1vslocation2_SE)

colnames(logfold_df)<- c("Genus", "LogFold_location1vslocation2", "SD")

##Filter out 0 log fold change and assign name according to the direction of change
logfold_df  <-  logfold_df %>%
  filter(LogFold_location1vslocation2 != 0) %>% arrange(desc(LogFold_location1vslocation2)) %>%
  mutate(group = ifelse(LogFold_location1vslocation2 > 0, "Location2", "Location1"))

logfold_df$Genus = factor(logfold_df$Genus, levels = logfold_df$Genus)

#FILTER WHEN NEEDED. This can be used to filter the results according to log fold change level
range(logfold_df$LogFold_location1vslocation2)
#> [1] -2.926409  3.195890
logfold_df_filtered <- rbind(logfold_df[which(logfold_df$LogFold_location1vslocation2>=1.5),], logfold_df[which(logfold_df$LogFold_location1vslocation2<=-1.5),])#If not filtering, use in plotting logfold_df

Now that we have prepared all data, we can use ggplot for plotting log fold change

waterfall_location1 <-  ggplot(data = logfold_df_filtered,
           aes(x = Genus, y = LogFold_location1vslocation2, fill = group, color = group)) +
  geom_bar(stat = "identity", width = 0.7,
           position = position_dodge(width = 0.4)) +
  geom_errorbar(aes(ymin = LogFold_location1vslocation2 - SD, ymax = LogFold_location1vslocation2 + SD), width = 0.2,
                position = position_dodge(0.05), color = "black") +
  labs(x = NULL, y = "Log fold change",
       title = "Waterfall Plot for the Location Effect") +
  theme_bw() +
  theme(legend.position = "right",
        legend.title =element_blank(),
        plot.title = element_text(hjust = 0.5),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_text(angle = 60, hjust = 1), plot.margin=margin(10,10,10,100))
waterfall_location1

As we mentioned above, we will also produce a graphic with relative abundance, specifically, a back-to-back plot.

#Again, bind phylum name and genus name 
pyramidal_df<- genus_location_table
for (i in 1:nrow(pyramidal_df)){
  pyramidal_df$Classification[i]<- paste(pyramidal_df$Phylum[i],pyramidal_df$Genus[i],sep="|")
  }

#Now, transform it to ggplot format
pyramidal_loc1vsloc2 <- pyramidal_df %>% select(c(13, 7:8)) %>% #select columns with genera and abundance
  pivot_longer(!Classification, names_to="Samples", values_to="Abundance")# %>%

#Filter pyrmaidal_loc1vsloc2 to include only significat genera according to ANCOM
pyramidal_loc1vsloc2_filt <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Classification %in%logfold_df_filtered$Genus ),]

#Split table according to location
loc1pyrm <- pyramidal_loc1vsloc2_filt[which(pyramidal_loc1vsloc2_filt$Samples=="location1"),]
loc2pyrm <- pyramidal_loc1vsloc2_filt[which(pyramidal_loc1vsloc2_filt$Samples=="location2"),]


#GRAPHIC
pyramidalggplot=ggplot(data = pyramidal_loc1vsloc2_filt, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
  geom_bar(loc1pyrm, stat="identity", mapping=aes(y=-Abundance))+
  geom_bar(data=loc2pyrm, stat="identity")+
  theme_bw()+
  scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-3,3), breaks=seq(-3,3,1))+
  labs(y="Relative abundance (%)")+
  theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
        axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
  coord_flip()



pyramidalggplot

Constrained analysis of proximities (CAP)

Microbial communities are greatly influenced by environmental factors. Thus, it is interesting to analyze how these variables relate to community changes. This can be achieved by studying different soil physicochemical parameters and fitting them to an ordination plot. The package vegan includes several functions and a very useful tutorial.

Among the options offered by vegan9, we have implemented them in a way it is logical for our daily studies. We will exemplify this with some data from pine trees (which can be downloaded in the same link as the sequences).

We will read the phyloseq file (saved as .RDS) and get the soil physicochemical parameters from metadata

#Read phyloseq and get relative abundance
cap_phyloseq <- readRDS(file = "cap_phyloseq")
cap_phyloseq_rel <- transform_sample_counts(cap_phyloseq,function(x){x/sum(x)}*100)

#Get ASV and metadata
ASV_cap<-as.data.frame(t(otu_table(cap_phyloseq_rel)))
metadata_cap <- as.matrix(sample_data(cap_phyloseq))
metadata_cap <- as.data.frame(metadata_cap)

metadata_cap <- metadata_cap[1:ncol(metadata_cap)-1] #Remove column Season_NucleicAcid

Now, we will generate two models. CAP1 will include all variables to be part of constrained analysis, that is, it will explained the variance due to all variables, and CAP0, which is the opposite of CAP1, i.e., fully unconstrained.

These two models will be the input for ordistep function. The function starts with a model with all unconstrained variables (CAP0) and then, in each step, it adds a new variable to the constrained model. Then, it performs a checking step. If the variables left are no longer significant, or add no more explained variance to the model, the function stops. At the end, the output will be the formula with those variables that explained most variance of data. In this way, when having multiple variables, the model can be reduced and remove those that are interdependent.

#Compute CAP1 and CAP0 models
CAP1<-capscale(ASV_cap~Water+ln_P+SOM+N+ph_H2O+ph_KCl+CN+Na_Exch+Clay+Sand+Slime+Texture, data=metadata_cap,distance = "bray") 

CAP0<-capscale(ASV_cap~1, data=metadata_cap, distance = "bray")

#Get formula with ordistep
CAP_ordi<-ordistep(CAP0, scope=formula(CAP1))
#> 
#> Start: ASV_cap ~ 1 
#> 
#>           Df  AIC      F Pr(>F)   
#> + ph_KCl  11   28 1.2601  0.005 **
#> + ph_H2O  11   30 1.0487  0.320   
#> + Water   15   10 1.0896  0.345   
#> + Clay    15   10 1.0955  0.350   
#> + ln_P     8   33 1.0101  0.390   
#> + Na_Exch 10   33 0.9478  0.710   
#> + SOM     14   23 0.8758  0.840   
#> + N       13   28 0.8611  0.935   
#> + CN      16 -Inf                 
#> + Sand    16 -Inf                 
#> + Slime   16 -Inf                 
#> + Texture 16 -Inf                 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Step: ASV_cap ~ ph_KCl 
#> 
#>          Df    AIC      F Pr(>F)  
#> - ph_KCl 11 28.737 1.2601  0.015 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>           Df  AIC      F Pr(>F)
#> + ph_H2O   3   19 0.9191  0.515
#> + Na_Exch  4    8 1.0277  0.520
#> + ln_P     4   10 0.9176  0.625
#> + Water    5 -Inf              
#> + SOM      5 -Inf              
#> + N        5 -Inf              
#> + CN       5 -Inf              
#> + Clay     5 -Inf              
#> + Sand     5 -Inf              
#> + Slime    5 -Inf              
#> + Texture  5 -Inf

Thus, we know now which variables influence our microbial community according to ordistep. Although pH measured in H2O is a marginal result, we will also introduce it in our model because we were also interested in studying it. With theses variables, we can use the function envfit to fit these environmental factors as vectors into an ordination plot and get the correlation with each ordination axis.

ef <-  envfit (CAP1~ph_KCl+ph_H2O, data = metadata_cap, perm = 999)
#Now we can adjust pvalues
ef.adj <- ef
pvals.adj <- p.adjust (ef$vectors$pvals, method = 'BH')
ef.adj$vectors$pvals <- pvals.adj
ef.adj
#> 
#> ***VECTORS
#> 
#> $pvals
#> numeric(0)
#> 
#> 
#> ***FACTORS:
#> 
#> Centroids:
#>              CAP1    CAP2
#> ph_KCl5.6 -0.5680 -1.0802
#> ph_KCl5.7 -0.6385 -1.0075
#> ph_KCl5.8 -0.4246 -0.1969
#> ph_KCl5.9 -0.5018  0.1034
#> ph_KCl6.0  0.6368  0.3278
#> ph_KCl6.1  0.1524  0.6719
#> ph_KCl6.2 -0.3268  1.0491
#> ph_KCl6.4 -0.0529  1.3443
#> ph_KCl6.5  1.0748  0.1100
#> ph_KCl6.6  0.9464 -0.2098
#> ph_KCl7.0  0.4793 -0.3622
#> ph_KCl7.7  1.6547 -0.9666
#> ph_H2O6.4 -0.0529  1.3443
#> ph_H2O6.5 -0.5446  0.4548
#> ph_H2O6.6 -0.1920 -0.6177
#> ph_H2O6.7 -0.4580  0.6139
#> ph_H2O6.8 -0.3268  1.0491
#> ph_H2O6.9  0.1524  0.6719
#> ph_H2O7.0 -0.6357 -0.3595
#> ph_H2O7.1  1.0748  0.1100
#> ph_H2O7.3  0.6368  0.3278
#> ph_H2O7.7  0.4793 -0.3622
#> ph_H2O7.8  0.9464 -0.2098
#> ph_H2O8.3  1.6547 -0.9666
#> 
#> Goodness of fit:
#>            r2 Pr(>r)  
#> ph_KCl 0.8180  0.079 .
#> ph_H2O 0.8141  0.073 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Permutation: free
#> Number of permutations: 999

With plot_ordination we can get a graphic representation of ordination and fitted vectors. In this sense, we can visualize the direction of change of environmental variables.

#First, get distance matrix
distance_matrix <- phyloseq::distance(physeq = cap_phyloseq_rel, method = "wunifrac")

#Then, generate ordination with CAP (that is, constrained ordination to ph_KCl)
CAP_wunifrac <-  ordinate(physeq = cap_phyloseq_rel, method = "CAP",distance = distance_matrix,
                                  formula = ~ph_KCl+ph_H2O )

#Produce plot
CAP_wunifrac_plot  <- plot_ordination(physeq = cap_phyloseq_rel, ordination = CAP_wunifrac,
                                             color = "Species", axes = c(1,2)) +
  aes(shape = Species) +
  geom_point(aes(colour = Species), alpha = 1, size = 3.5) +
  scale_shape_manual(values=c("Pinaster"=16,"Other"=15),
                     breaks=c("Pinaster","Other"))+
  scale_color_manual(values = c("Pinaster"="brown", "Other"="orange"),
                     name="Host genotype",
                     breaks=c("Pinaster", "Other"),
                     labels=expression(paste("Confirmed ", italic("P. pinaster")), "Pine forest samples"))+
  guides(shape="none")+
  ggtitle("CAP on Weighted Unifrac distance")+
  theme_bw()+
  theme(legend.key=element_blank(),
        legend.title.align = 0.85,
        legend.title = element_text(face="bold"),
        axis.text = element_text(size=14),
        axis.title = element_text(size = 16),
        plot.title = element_text(hjust=0.5, face="bold"),
        legend.text = element_text(size = 16))+
  geom_hline(aes(yintercept = c(0.00)), lty=2, colour="grey")+
  geom_vline(aes(xintercept = c(0.00)), lty=2, colour="grey")

#Define arrows aesthetic and labels
arrowmat <-  vegan::scores(CAP_wunifrac, display = "bp")
arrowdf <-  data.frame(labels = rownames(arrowmat), arrowmat) 
arrow_map  <-  aes(xend = CAP1, yend = CAP2,x = 0, y = 0, shape = NULL, color = NULL, label=labels)
label_map <- aes(x = 1.1 * CAP1, y = 1.1 * CAP2, shape = NULL,color = NULL, label=labels)
arrowhead  <-  arrow(length = unit(0.02, "npc"))

#Introduce them to plot
CAP_wunifrac_plot +
  geom_segment(mapping = arrow_map, size = c(ph_KCl=1,ph_H2O=1), data = arrowdf, color = c(ph_KCl="black",ph_H2O="black"),arrow = arrowhead) +
  geom_text(mapping = label_map, size = 4,data = arrowdf, show.legend = FALSE)

References

  1. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016). “DADA2: High-resolution sample inference from Illumina amplicon data.” Nature Methods, 13, 581-583. doi: 10.1038/nmeth.3869 (URL: https://doi.org/10.1038/nmeth.3869).

  2. Lin H, Peddada SD (2020). “Analysis of compositions of microbiomes with bias correction.” Nature communications, 11(1), 1–11. doi: 10.1038/s41467-020-17041-7, https://www.nature.com/articles/s41467-020-17041-7.

  3. FIGARO: An efficient and objective tool for optimizing microbiome rRNA gene trimming parameters Michael M. Weinstein, Aishani Prem, Mingda Jin, Shuiquan Tang, Jeffrey M. Bhasin bioRxiv 610394; doi: https://doi.org/10.1101/610394

  4. MARTIN, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may 2011. ISSN 2226-6089. Available at: http://journal.embnet.org/index.php/embnetjournal/article/view/200. Date accessed: 06 oct. 2021. doi:https://doi.org/10.14806/ej.17.1.200.

  5. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772-780. doi:10.1093/molbev/mst010

  6. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3):e9490. Published 2010 Mar 10. doi:10.1371/journal.pone.0009490.

  7. McMurdie PJ, Holmes S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. Watson M, editor. PLoS One. 2013;8: e61217. pmid:23630581

  8. Robinson MD M D S G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26: 139–140. pmid:19910308

  9. Wagner H. Vegan: community ecology package. R package version 2.5.2–5. 2018. https://www.mcglinnlab.org/publication/2019-01-01_oksanen_vegan_2019/